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Abstract 



, In this paper, wc show the equivalence between mirror descent algorithms and al- 

I 1 ' gorithms generalizing the conditional gradient method. This is done through convex 

^ , duality, and implies notably that for certain problems (such as the support vector ma- 

Q ' chine), the primal subgradient method and the dual conditional gradient method are 
formally equivalent. The dual interpretation leads to a form of line search for mirror 

^vq ' descent, as well as guarantees of convergence for primal-dual certificates. 
> ' 

(N 

^ ■ 1 Introduction 

Many problems in machine learning, statistics and signal processing may be cast as convex 
. optimization problems. In large-scale situations, simple gradient-based algorithms with po- 

tentially many cheap iterations are often preferred over methods, such as Newton's method 
or interior-point methods, that rely on fewer but more expensive iterations. 

^ | In this paper, we consider two classical algorithms, namely (a) subgradient descent and 

its mirror descent extension [HH113], and (b) conditional gradient algorithms, sometimes 
referred to as Frank- Wolfe algorithms [H El El [TJ [8] . 

Subgradient algorithms are adapted to non-smooth situations, have a convergence rate of 
0(t~ l l 2 ) in terms of objective values, after t steps. This convergence rate goes to 0{t~ l ) 
when the objective function is strongly convex [9]. Conditional-gradient algorithms are 
tailored to the optimization of smooth functions on a compact convex set, for which mini- 
mizing linear functions is easy (but typically, orthogonal projections would be hard, so that 
proximal methods [101 [TT] cannot be used efficiently) . They also have a convergence rate of 
0(l/t) [6]. The main results of this paper are (a) to show that these two sets of methods 
are in fact equivalent by convex duality, (b) to recover a previously proposed extension of 
the conditional gradient method which is more generally applicable |12j . and (c) provide 
explicit convergence rates for primal and dual iterates. 
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More precisely, we consider a convex function / denned on M. n , a convex function h denned 
on W, both potentially taking the value +00, and a linear operator A from W to M n . We 
consider the following minimization problem, which we refer to as the primal problem: 

min h(x) + f(Ax). (1) 

x£RP 

Throughout this paper, we make the following assumptions regarding the problem: 



- / is £?-Lipschitz-continuous and finite on M n , i.e., for all x, y S W 1 , \f{x)—f(y)\ ^ 
B\\x — y\\, where || • || denotes the Euclidean norm. Note that this implies that the 
domain of the Fenchel conjugate /* is included in the ball of center and radius 
B. We denote by C the compact domain of /*. Thus, for all z £ W 1 , f(z) = 
max^ec y T z - f*(y). 

Note that the compactness of the domain of /* is crucial and allows for simpler proof 
techniques with explicit constants (see a generalization in |12j). 

- h is lower-semicontinuous and ^-strongly convex on W. This implies that h* 
is defined on MP, differentiable with (l//x)-Lipschitz continuous gradient [131 E]. Note 
that the domain K of h may be strictly included in W. 

Moreover, we assume that the following quantities may be computed efficiently: 

- Subgradient of /: for any z G M n , a subgradient of / is any maximizer y of 
maxyec y T z - f*(y). 

- Gradient of h*: for any z € W, (h*)'(z) may be computed and is equal to the unique 
maximizer x of max^R? x T z — h(x). 

The values of the functions /, h, f* and h* are useful to compute duality gaps. 

We denote by 5 pr imai(^) = h{x) + f(Ax) the primal objective in Eq. ([T]). It is the sum 
of a Lipschitz-continuous convex function and a strongly convex function, potentially on a 
restricted domain K. It thus well adapted to the subgradient method. 

We have the following primal/dual relationships (obtained from Fenchel duality |13j): 

min h(x) + f(Ax) = min maxh(x) + y T (Ax) — f*(y) 

x£Rp W V ^ xm? y&C V ' X ' V ' 

= max < min h(x) + x T A T y > — f*(y) 

y&C \ xf- 



max-h*{-A T y)-r(y). 
y ec 



This leads to the dual maximization problem: 



max-h*(-A T y)-r(y). (2) 

y&C 
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We denote by ffduai(y) = — h*(— A y) — f*(y) the dual objective. It has a smooth part 
— h*(— A T y) defined on W 1 and a potentially non-smooth part —f*(y), and the problem is 
restricted onto a compact set C. When /* is linear (and more generally smooth) on its 
support, then we are exactly in the situation where conditional gradient algorithms may be 
used. 

Given a pair of primal-dual candidates (x,y) € K x C, we denote by gap(x, y) the duality 
gap: 

gap(x, y) = 9 P rimai(x) - g d uM = [h(x) + h*(-A T y) + y T Ax] + [f(Ax) + f*(y) - y 1 Ax] . 

This quantity serves as a certificate of optimality, as 

gap(x,y) = \g piima \(x) - min 5 primal (x')l + [max5 dual (y') -ffduai(y)]- 

x'eK y'eC 

2 Examples 

Typical cases for h (often the regularizer in machine learning and signal processing) are the 
following: 

- Squared Euclidean norm: h(x) = ^||x|| 2 , which is /x-strongly convex. 

Squared Euclidean norm with convex constraints: h(x) = ^\\x\\ 2 + Ik{x), with 
Ik the indicator function for K a convex set, which is //-strongly convex. 

- Negative entropy: h{x) = Ya=1 x i x i + ^s(x), where S = {x € W 1 , x ^ 
0, Y17=l x i = -'-}' w hich is 1-strongly convex. More generally, many barrier functions 
of convex sets may be used (see examples in [31 [15] ) . 

Typical cases for / (often the data fitting terms in machine learning and signal processing) 
are functions of the form f(z) = ^ Y17=i 

- Least-absolute-deviation: li(zi) = \z% — yi\, with j/j E K. Note that the square loss 
is not Lipschitz-continuous on W p (but it is, when restricted to a compact set). 

- Logistic regression: li(zi) = log(l + exp(— Zjj/j)), with yi G {—1, 1}- Here /* is not 
linear in its support, and /* is not smooth, since it is a sum of negative entropies. 
This extends to any negative exponential family log-likelihood. Note that / is then 
smooth and proximal methods with an exponential convergence rate may be used 
(which correspond to a constant step size in the algorithms presented below, instead 
of a decaying step size) . 

- Support vector machine: £i(zi) = max{l — yjZj,0}, with j/j G { — 1,1}. Here 
/* is linear on its domain (this is a situation where subgradient and conditional 
gradient methods are exactly equivalent). This extends to more general max- margin 
formulations [16j . 
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Other examples may be found in signal processing; for example, total-variation denoising, 
where the loss is strongly convex but the regularizer is non-smooth [T7j, or submodular 
function minimization cast through separable optimization problems |18j . 



3 Mirror descent for strongly convex problems 

We first assume that the function h is essentially smooth (i.e., differentiable at any point 
in the interior of K, and so that the norm of gradients converges to +00 when approaching 
the boundary of K); then h! is a bijection from int (K) to MP, where K is the domain of h 
(see, e.g., [E]). We consider the Bregman divergence 

D(x 1 ,x 2 ) = h{x\) - h(x 2 ) - (xi - x 2 ) T h'(x 2 ). 

It is always defined on K xint(iT), and is nonnegative. If x\,x 2 £ mt(K ), then D{x\,x 2 ) = 
if and only if x\ = x 2 . See more details in [3]. For example, when h(x) = ^\\x\\ 2 , we have 
D(x\,x 2 ) = \\\x\ - x 2 \\ 2 . 

Subgradient descent for square Bregman divergence. When h(x) = ^||x|| 2 , the 
primal problem becomes: 

min/(AE) + ^||x|| 2 . 
x&K 2 

The projected subgradient method starts from xq € M p , and iterates the following recursion: 

x t = x t -\ - — [A r f'(Ax t -i) + fixt-i] , 

where f'(Axt-i) is any subgradient of / at Axt-\. The step size is — . 
The recursion may be rewritten as 

fjLx t = fixt-i - p t [A T f'(Axt-i) + fjixt-i] , 
which is equivalent to the minimization of 

(a; - xt-i) 1 [A 1 y t _\ + fix t -i] + — \\x - x t _i|| 2 , 

2pt 

which is the traditional proximal step, with step size pt/p- 

Mirror descent. We may interpret the last formulation as the minimization of 

(x - St-i) T 5p rima i(%-i) + —D(x,x t -i), 

Pt 

with solution defined through (note that h! is a bijection from int (if) to MP): 

h'(x t ) = h'ixt-^-pt^f'iAxt-xj + h'ixt-i)] 
= (1 - p t )ti(x t -i) - p t A T f'(Ax t ^). 
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Thus, we now define the mirror descent recursion as follows: 

yt-i G argmax y T Ax t -i - f*(y), 
x t = arg min h{x) - (1 - p t )x h'{x t -\) + p t x A y t -\- 

x£KP 

Proposition 1 (Convergence of mirror descent in the strongly convex case) Assume 
that (a) f is Lipschitz- continuous and finite on W, with C the domain of f* , (b) h is essen- 
tially smooth and p-strongly convex. Consider pt = 2/(t + 1) and R 2 = maxyygc" ll^ T (y — 
y')|| 2 . Denoting by x* the unique minimizer of g pr imab after t iterations of the mirror descent 
recursion of Eq. we have: 

R 2 



9 \ t(t 2 +l) X^"" 1 ) - 9ptimai(x*) ^ 
^ ' u=l ' 

min (pprimall^u) ~ 5primal(^* ) ) < ~ 77T~iT' 
W6{0,...,t-1} L > p{t+l) 

D(x*,x t ) < 



p{t + l) 

R 2 



R 



Proof We follow the proof of [3] and adapt it to the strongly convex case. We have: 



D(x*,x t ) - D(x*,x t -i) 
= h(xt-i) - h(x t ) - (x* - x t ) T h'(x t ) + (x* - x t -i) h'(x t -i) 

= h(x t -{) - h(x t ) - (x* - x 4 ) T [(1 - p t )h'(x t -x) - ptA T f'(Ax t -i)] + (x* - z f _i) T fc'(:ct_i) 
= h(xt-i) - h(x t ) - (x t -i - xt) 1 h! \x t -i) + Pt(x* ~ x t ) T g' pTimal {x t -i) 

= [ - D(x t , x t -i) + p t (x t -i - x t ) T g' piimSLl {x t -i)] + [p t {x* - x t -!) T g' pvimal (x t -i)] . (4) 

In order to upper-bound the two terms in Eq. we first consider the following bound 
(obtained by convexity of /): 

f(Ax*) + h(x*) ^ f(Ax t -i) + h(x t -i) + (x* - xt-i) 1 [A 1 y t -\ + h'(x t -i)] + D(x*, xt-i), 
which may be rewritten as: 

gprima\(x t -l) ~ g P Timsl(x*) ^ --D(x*,X t _i) + (x t -i ~ X*) T g primal (x t -l) , 

which implies 

p t {x* - X t -l) T g' prhaal (x t -l) ^ -p t D(x*,X t -l) ~ Pt[9pTimal(xt-l) - 3primal0r*)] . (5) 

Moreover, 

-D(x t ,x t -i) + pt(x t -i - x t ) T g' imal (x t _i) = max-D(x,x t -i) + pt(x t -i - x) T z = <p(z), 

1 T-ClUP 
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with z = Ptffpj-imai^t-i)- The function x \-> D(x,Xt-i) is /i-strongly convex, and its Fenchel 
conjugate is thus (l//x)-smooth. This implies that ip is (l//x)-smooth. Since <p(0) = and 
<p'{0) = 0, <^(z) < ^IMI 2 . Moreover, z = p t (^ T /'(^_i) + h(x t _ x )). Since /i'(^-i) € 
—A T K (because h'(xt-i) is a convex combination of such elements), then \\A T f'(Axt-i) + 
/i(x 4 _i)|| 2 < R 2 = max yit y 2eK \\A T ( yi - y 2 )\\ 2 = diam(,4 T K) 2 . 

Overall, combining Eq. ([5]) and (f(z) ^ R 2 ^ into Eq. @, this implies that 

P 2 

D(x*,X t ) - D(x if ,X t _i) -^-R 2 - ptD(x*,X t -l) - pt [Sprimal(zi-l) - ffprimal^*)] , 

that is, 

ffprimal(^-l) ~ fi'primal^*) < + ~ 1 ) jD ( X * ' ) ~ P^^C^*' 

2 

With /?/ = , we obtain 

xr / \ , o ^ # 2 (t — l)t _ . , t(t + l)_, . 

H5primal(^-l) - S'primalla;*)] ^ ^ ~ jT H ^ ^(^* , X t -l ) D{X*,X t )- 

Thus, by summing from u = 1 to u = t, we obtain 

y^lffprimal^M-l) ~ ffprimal(»*)] < ~ ^^—"^ X t ) 

that is, 



2 * i? 2 
D(x*,a; t )+ t U +1 \ ^2 u [9pTUnal(x u -l) -0primal(z*)] < , 1 

^ ' 14=1 ^ 



R 2 

This implies that (a) D(x*,xt) $5 —7 r, i.e., the iterates converges, and (b) 



t 

9 



-q ^ ] UX U — 1^ Sprimal^*) ^ [i(t -\- 1) ' 



the objective functions at an average data point converges, and (c) 

R 2 

nhn 9pnmal{%u) S'primall^*) ^ 77 ; 7T; 
u6{0,...,t-l} /i(t + 1) 

i.e., one of the iterates has an objective that converges. 
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Averaging. Note that with the step size pt = , we have 

t ~t~ 1 

which implies 

t(t + l)h'{x t ) = (t- l)th'(x t -i) - 2tA T f(Ax t _ 1 ). 
By summing these equalities, we obtain t{t + l)h'(xt) = — 2^u=i U A T f'(Ax u -i), i.e., 

2 * 

= WTT) S u [ " 2^ T /'(^ U _ 1 )] , 
that is, h'(xt) is a weighted average of subgradients. 

Generalization to /i non-smooth. The previous result does not require h to be essen- 
tially smooth, i.e., it may be applied to h(x) = ^||a;|| 2 + Ik(%) where K is a convex set 
strictly included in W . In the mirror descent recursion, 

y t -i S argmax y T Ax t -\ - f*(y), 
y ec 

x t = arg min h{x) - (1 - p t )x h'{x t -i) + p t x T A T y t -i, 

there may then be multiple choices for h'(xt-i). If we choose for h'{xt-i) at iteration t, 
the subgradient of h obtained at the previous iteration, i.e., such that h'(xt-i) = (1 — 
Pt-i)h' (xt-2) — Pt~iA T y~t-2, then Prop. [U above holds. 

Note that when h(x) = ^||x|| 2 + Ir{x), the algorithm above is not equivalent to projected 
gradient descent. Indeed, the classical algorithm has the iteration 

x t = U K (x t -i - \t\pxt-\ +A T f'{Ax t -i)]) = Uk((1 ~Pt)xt-i + Pt[- ^A T f(Ax t -i)] 

and corresponds to the choice h'{xt-\) = p-Xt-i in the mirror descent recursion, which, 
when xt-\ is in the boundary of K, is not the choice that we need for the equivalence. 



4 Conditional gradient method and extensions 

Conditional gradient method. Given a maximization problem of the form (i.e., where 
/* is zero on its domain) 

max — h*(— A T y), 
yec 

the conditional gradient algorithm consists in the following iteration (note that below 
Axt-i = A(h*)' (—A 1 yt—i) is the gradient of the objective function): 

Xt-i = arg min h(x) + x T A T y t -i 

y t -i G argmaxy T Ar t _i 
yec 

yt = (1 - pt)vt-i + ptVt-i- 



7 



It corresponds to a linearization of —h*(—A y) and its maximization over the compact 
convex set C. As we show later, the choice of pt may be done in different ways, through a 
fixed step size of by (approximate) line search. 

Generalization. Following [T2], the conditional gradient method can be generalized to 
problems of the form 

ma x-h*(-A r y)-f*(y), 
y ec 

with the following iteration: 

Xt-i = argmin x . eKP h(x) + x T A T y t ^i = (h*)'(-A T y t ^i) 
y t -i G argmax ye cy T Ax^i - f*(y) (6) 
Vt = (1 - Pt)Vt-i + PtVt-i- 

The previous algorithm may be interpreted as follows: (a) perform a first-order Taylor 
expansion of the smooth part —h*(—A T y), while leaving the other part —f*(y) intact, (b) 
minimize the approximation, and (c) perform a small step towards the maximizer. Note 
the similarity (and dissimilarity) with proximal methods which would add a proximal term 
proportional to \\y — yt-i\\ 2 , leading to faster convergences, but with the extra requirement 
of solving the proximal step [TQ1 CD] ■ 

When h is essentially smooth (and thus h* is essentially strictly convex), it can be reformu- 
lated with h'(xt) = —A T y t as follows: 

h'(x t ) = (1 - pt)ti{x t _i) - p t A T argmax {y T Ax t _! - f*(y)\, 

= (l-p t )h'(x t - 1 )-p t A T f'{Ax t .. 1 ), 

which is exactly the mirror descent algorithm described in Eq. ([3]). This leads to the 
following proposition: 

Proposition 2 (Equivalence between mirror descent and generalized conditional gradient) 

Assume that (a) f is Lipschitz- continuous and finite on W, with C the domain of f* , (b) h 
is p-strongly convex and essentially smooth. The mirror descent recursion in Eq. started 
from xq = (h*)'(—A T yo), is equivalent to the generalized conditional gradient recursion in 
Eq. (0), started from yo £ C. 

When h is not essentially smooth, then with a particular choice of subgradient, the two al- 
gorithms are also equivalent. We now provide convergence proofs for the two versions (with 
adaptive and non-adaptive step sizes); similar rates may be obtained without the compact- 
ness assumptions [12], but our results provide explicit constants and primal-dual guarantees. 
We first have the following convergence proof for generalized conditional gradient with no 
line search: 
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Proposition 3 (Convergence of extended conditional gradient - no line search) 

Assume that (a) f is Lipschitz- continuous and finite on W, with C the domain of f* , (b) h 
is p-strongly convex. Consider p t = 2/(t + 1) and R? = max.y^/gc ||A T (y — y')ll 2 - Denoting 
by y* any maximizer of (/dual on C , after t iterations of the mirror descent recursion of 
Eq. |7|), we have: 



0dual(y*) - ffdual(2/f) < 

min gap(x t ,y t ) < 
ue{o,.. .,*-!} 



2R 2 



8R 2 



Proof We have (using convexity of /* and (-) -smoothness of h*): 
9duM = -h*(-A T y t )- f*(y t ) 

\i- Pt )t{yt-i) + Ptr{yt-i) 



- h*(-A T y t . 1 ) + (y t - y t _i) T Ax t _i - 
-h*(-A T y t ^ 1 ) + pt(y t -i - yt-i^Axt^ 



2p 

R 2 pI 

2p 



(l ~~ Pt)f*(yt-i) — ptf*(y~t-i) 



5duai(yt-i) + Pt(yt-i - y t -i) T Ax t -i 



2p 



+ ptf*(yt-i) - ptf*(yt-i) 



, K p t 

ffduai(yt-i) — ^— + Pt 



fltiual^ 



t-1) 



2/i 
B Pt 

2/i 



+ Pt 



f*(yt-i) - f*(yt-i) + (Vt-i ~ yt-ifAxt-x 
f*(yt-i) - yJ-iAxt-i - (f*(y t -i) - yJ^Axt-i) 



Note that by definition of yt-i, we have (by equality in Fenchel- Young inequality) 

-fiyt-^+yJ^Axt-i = /(Ax t _i), 
and h*(—A T y t -i) + h(xt-i) + xJ_ 1 A T y t -\ = 0, and thus 

f*(yt-i)-yl-iAx t -i-(f*{yt-i)-y]-iAxt-i) = yprimai^t-O-flduaite/t-i) = g&p(xt-i,yt-i)- 
We thus obtain, for any p t € [0, 1]: 



2/i 



5dual(yt) - 5dual(y*) > 0dual(j/t-l) ~ 5dual(y*) + p*gap(x t _i , y t -l ] 



which is the classical equation from the conditional gradient algorithm [6j [TJ [8] , which we 
can analyze through Lemma Q] (see end of this section). ■ 



Proposition 4 (Convergence of extended conditional gradient - with line search) 

Assume that (a) f is Lipschitz- continuous and finite on W, with C the domain of f* , (b) h 
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is p-strongly convex. Consider pt = min{-^-gap(xt_i, yt—i), 1} and R 2 = max y y g <7 ll j 4 T (y — 
y')|| 2 . Denoting by y* any maximizer of ffdual on C, after t iterations of the mirror descent 
recursion of Eq. ftfjj), we have: 

. . . . 2ii 2 

3dual(2/*J - 5dual(yt) < 



min gap(xf,y 4 ) ^ 



/x(* + 3)' 
2« 2 



ue{o,...,t-i} fx(t + 3) 

Proof The proof is essentially the same as the previous one, with a different application 
of Lemma Q] (see end of this section). ■ 



Lemma 1 Assume that we have three sequences (u t )t^o> ( v t)t^o> an d (Pt)t>o> an d a positive 
constant A such that 

Vt ^ 0, ft £ [0, 1] 
Vt > 0, SC ^ < ut 

A 2 

Vt > 1, ^ ^ n t _i - ft^_i + —ft . 

- If pt = 2/(t + 1), t/ien «j ^ ^4- and /or aZZ t ^ 1, there exists at least one k € 
{L</2j,...,t} suc/i that v k < J^-. 

A „2 ■ r„, .//in +u„ m „. ^ 



# = argmm ptg [ ^ -p t v t -i + fft = min{»t_i/i, 1}, t/ien u t < ^ and /or a// 

2A 
t+3 ' 



t ^ 2, t/iere exists at least one k € {|_t/2j — 1, ...,t} suc/i t/iat ^ ^pr. 



Proof In the first case (non-adaptive sequence ft), we have po = 1 and lit ^ (1 — pt)ut—\ + 

.4 



4p 2 , leading to 



2 



e n ^-ps)pi 



u=l s=u+l 



For ft = -j^j, this leads to 

t 



AJ^u(u + l) 4 2A 



2 ^ t(t + l) (n + 1) 2 ^ t + 1' 

u=l 

Moreover, for any k < j, by summing u t ^ Ut-i — ftft-i + 4ft 2 for t € {fc + 1, . . . , j}, we 



A „2 
2 

get 

/i \ 2 



t*i < u k - ptvt-i + «- 53 * 



2 

t=fc+l t=fc+l 
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Thus, if we assume that all vt-i ^ (3 for t £ {k + 1, . . . , j}, then 



i=jfc+l t=k+l t=k+l K ' 

2A a sr 1 

M - +M f [i_J_- 
i 4-^ L t * + 



t=fe+i 

4A 



Moreover, Et=fc+l Pt = 2 El= k +i ITT > 2 i+T" Thus 



jfc + 1 

hi' 

2A j + 1 



fc + 1 j - k 

Using j = t + 1 and k = \t/2\ — 1, we obtain that /3 ^ (this can be done by considering 
the two cases t even and t odd) and thus max ng j^/ 2 J,...,t} u u ^ nrp 
We now consider the line search case: 

V 2 

- If Vt-i ^ A, then p t = -M - ) and we obtain ^ it t _x — 

- If ^ A, then p t = 1, and we obtain ut ^ ut-i — Vt-i + 4 ^ n *-! ~~ HT"- 

Putting all this together, we get ut ^ ut-i — A min{wt_i, v^_i/A}. This implies that (ut) is 
a decreasing sequence. Moreover, iti ^ ^, thus, Ui ^ mm{uo, A/2} ^ A. We then obtain 
for all t > 1, Ut ^ ttt-i — ^jUt_i- From which we deduce, ^ -u^ 1 — We can now 
sum these inequalities to get u^ 1 ^ u^ 1 — that is, 

1 1 2A 

Ut ^ Zl . ,_1 < , -1 n ,n . ,_1 < 



u i 1 + t ix ^ max{n 1 ,2M} + ^ ^ t + 3" 

Moreover, if we assume that all Uf_i ^ /3 for t 6 + . . . , j}, following the same reasoning 
as above, then 

min{/3,/3 2 M}(j-£0 < 



k + 3 

Using j = t + 1 and k = [t/2\ — 1, we have (fc + 3)(j — k) > l(£ + 3) 2 (which can be checked 
by considering the two cases t even and t odd). Thus, we must have (3 ^ A (otherwise we 
obtain fi < 4A/(t + 3) 2 , which is a contradiction), and thus [3 2 < 4^4 2 /(i + 3) 2 , which leads 
to the desired result. 
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5 Discussion 



The equivalence shown in Prop. [2] has several interesting consequences and leads to several 
additional related questions: 

- Primal-dual guarantees: Having a primal-dual interpretation directly leads to 
primal-dual certificates of guarantees, with a gap that converges at the same rate 
^ (see [8] for similar results for the regular conditional gradient method). These 
certificates may either be taken to be the pair (xt,yt), in which case, we have shown 
that after t iterations, at least one of them has the guarantee. 

Alternatively, for the fixed step-size pt = -Hrf , we can use the same dual candidate yt = 

HWTj S«=i u Vu-i (which can thus also be expressed as an average of subgradients) 

and averaged primal iterate t (t+i) Ylu=i ux u-i- Thus, the two weighted averages of 
subgradients lead to primal-dual certificates. 

- Line-search for mirror descent: Prop. H] provides a form of line search for mirror 
descent (i.e., an adaptive step size). Note the similarity with Polyak's rule (see, 
e.g., [H]). 

- Absence of logarithmic terms: Note that we have considered a step-size of jtj, 
which avoids a logarithmic term of the form log t in all bounds (which would be the 
case for p t = j). This also applies to the stochastic case [20] . 

- Properties of iterates: While we have focused primarily on the convergence rates 
of the iterates and their objective values, recent work has shown that the iterates 
themselves could have interesting distributional properties [2TJ[52], which would be 
worth further investigating. 

Stochastic approximation and online learning: There are potentially other ex- 
changes between primal/dual formulations, in particular in the stochastic setting (see, 
eg., [23]). 

Simplicial methods and cutting-planes: The duality between subgradient and 
conditional gradient may be extended to algorithms with iterations that are more 
expensive. For example, simplicial methods in the dual are equivalent to cutting- 
planes methods in the primal (see, e.g., |24j ) . 

Conditional gradient algorithms for penalized problems: Another interesting 
example for machine learning is more naturally described from the dual formulation: 
given a smooth loss term h*(—A T y) (this could be least-squares or logistic regression), 
a typically non-smooth penalization is added, often is the form of a constant times 
a norm, i.e., f*(y) = AO(j/). When the proximal operator for the norm Q is easy to 
compute, then the minimization of h*(— A T y) + f*(y) may readily be done through 
proximal methods [10} 111]. However, in some situations, the only efficient operation 
on the norm $7 is the maximization of linear functions on the unit ball. 
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Conditional gradient algorithms are applicable to functions /* with a compact do- 
main and are thus adapted to constrained problems where /* would be the indicator 
function of the ball {y £ IR n , O(y) v}. However, the penalized problem defined 
above does not satisfy the compactness assumption and an extension has been re- 
cently proposed in |25j : given a (potentially loose) bound v on an optimal solution, a 
line-search-based algorithm is derived that leads to a convergence rate of 0(l/t), with 
proportionally constant independent of v. A simpler algorithm that does not exhibit 
this property may be obtained by considering the function defined as f*(y) = AO(y) 
for Q(y) ^ v and +00 otherwise, which does have a compact domain, and the gener- 
alized conditional gradient algorithms described in Section UJ 
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